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Abstract 

In this paper, we propose a deep neural network architecture for object recogni¬ 
tion based on recurrent neural networks. The proposed network, called ReNet, 
replaces the ubiquitous convolution-tpooling layer of the deep convolutional neu¬ 
ral network with four recurrent neural networks that sweep horizontally and ver¬ 
tically in both directions across the image. We evaluate the proposed ReNet on 
three widely-used benchmark datasets; MNIST, CIFAR-10 and SVHN. The re¬ 
sult suggests that ReNet is a viable alternative to the deep convolutional neural 
network, and that further investigation is needed. 


1 Introduction 


Convolutional neural networks [CNN, IFukushiriial 
method of choice for object recognition [see, e.g., 
to be successful at a variety of benchmark problems including, but not limited to, handwritten digit 
recognition [see, e.g.,|Ciresan et al. 2012b|, natural image classification [see, e.g., Lin et al. [2014 


19801 [LeCun et akl 1989| have become the 
Krizhevsky et al. 2012|. They have proved 


Simonyan and Zisserman 2015 Szegedy et al.| 2014) , house number recognition [see, e.g.. Good 
fellow et al.||2014|, traffic sign recognition [see, e.g.,|Ciresan et al. |2012a|, as well as for speech 


recognition [see, e.g., Abdel-Hamid et al. [2012 Sainath et al. [2013 Toth 2014| . Furthermore, i 
age representations from CNNs trained to recognize objects on a large set of more than one million 
images [Simonyan and Zisserman||2015[|Szegedy et aH 2014| have been found to be extremely help 


ful in performing other computer vision tasks such as image caption generation [see, e.g., Vinyals 
2014[ Xu et al. 2015|, video description generation [see, e.g., Yao et al^ 2015| and object 


et al. 


Jaitly||2014|. RNNs have also been used together with CNNs in speech recognition [Saiinath et al. 


localization/detection [see, e.g., |Sermanet et aL]|2014| . 

While the CNN has been especially successful in computer vision, recurrent neural networks (RNN) 
have become the method of choice for modeling sequential data, such as text and sound. Natural 
language processing (NLP) applications include language modeling [see, e.g., Mikolov||2012| , and 
machine translation | Sutskever et al.[ 2014| Cho et al. 2014 Bahdanau et al. 201 5|. Other popular 
areas of application include offline handwriting recognition/generation [Graves and Schmidhuber 


2009[|Graves et al.]|2008|[Graves[[MT3| and speech recognition [Chorowski et al.| 2014| Graves and] 


20151. The recent revival of RNNs has largely been due to advances in learning algorithms [P; 


ascanu 


et al.||2013] [Martens and Sutskever||2011 

and Schmid iuber||1997||Cho et al. 1(2014 


and model architectures [Pascanu et al.j[20141 [Hochreiter 


* Equal contribution 
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The architecture proposed here is related and inspired by this earlier work, but our model relies on 
purely uni-dimensional RNNs coupled in a novel way, rather than on a multi-dimensional RNN. 
The basic idea behind the proposed ReNet architecture is to replace each convolutional layer (with 
convolution-tpooling making up a layer) in the CNN with four RNNs that sweep over lower-layer 
features in different directions: (1) bottom to top, (2) top to bottom, (3) left to right and (4) right 
to left. The recurrent layer ensures that each feature activation in its output is an activation at 
the specific location with respect to the whole image, in contrast to the usual convolution-tpooling 
layer which only has a local context window. The lowest layer of the model sweeps over the input 
image, with subsequent layers operating on extracted representations from the layer below, forming 
a hierarchical representation of the input. 


[Graves and Schmidhuber|p009[ have demonstrated an RNN-based object recognition system for of¬ 


fline Arabic handwriting recognition. The main difference between ReNet and the model of Graves 
and Schmidhuber| 120091 is that we use the usual sequence RNN, instead of the multidimensional 


RNN. We make the latter two parts of a single layer, usually (horizontal) RNNs or one (horizontal) 
bidirectional RNN, work on the hidden states computed by the first two (vertical) RNNs, or one 
(vertical) bidirectional RNN. This allows us to use a plain RNN, instead of the more complex mul¬ 
tidimensional RNN, while making each output activation of the layer be computed with respect to 
the whole input image. 


One important consequence of the proposed approach compared to the multidimensional RNN is 
that the number of RNNs at each layer scales now linearly with respect to the number of dimensions 
d of the input image {2d). A multidimensional RNN, on the other hand, requires the exponential 
number of RNNs at each layer (2'^). Furthermore, the proposed variant is more easily parallelizable, 
as each RNN is dependent only along a horizontal or vertical sequence of patches. This architectural 
distinction results in our model being much more amenable to distributed computing than that of 
[Graves and Schmidhuber| | [2009| . 

In this work, we test the proposed ReNet on several widely used object recognition benchmarks. 


[et al.j [201 1| . Our experiments reveal that the model performs comparably to convolutional neural 
networks on all these datasets, suggesting the potential of RNNs as a competitive alternative to 
CNNs for image related tasks. 


namely MNIST [LeCun et al.|T99^, CIFAR-10 [Krizhevsky and Hinton [2009) and SVHN [Netzer 


2 Model Description 


Let us denote by = {xij} the input image or the feature map 
from the layer below, where X G with w, h and c the 

width, height and number of channels, or the feature dimensionality, 
respectively. Given a receptive held (or patch) size of Wp x hp, we 
split the input image X into a set of J x J (non-overlapping) patches 
P = {pij}, where / = J = ^ wA pij G is the 

(t, j)-th patch of the input image. The first index i is the horizontal 
index and the other index j is the vertical index. 

First, we sweep the image vertically with two RNNs, with one RNN 
working in a bottom-up direction and the other working in a top- 
down direction. Each RNN takes as an input one (flattened) patch 
at a time and updates its hidden state, working along each column 
j of the split input image X. 

= fvFWD{zfj_j^,p,^j), for j = 1, • • • , J (1) 

v^j = /vREV {z^j +1 , Fz J-), for j = J, • • • , 1 (2) 


Note that /vfwd and /vrev return the activation of the recurrent 
hidden state, and may be implemented either as a simple tanh layer, 
as a gated recurrent layer [ Cho et al^ 2014) o r as a long short-term 
memory layer |Hochreiter and Schmidhuber][1997[. 



Figure 1: A one-layer ReNet 


2 





























After this vertical, bidirectional sweep, we concatenate the intermediate hidden states vfj and 

at each location {i,j) to get a composite feature map V = {yijYiZi’ i ’ where Vij G and d 
is the number of recurrent units. Each Vij is now the activation of a feature detector at the location 
(z, j) with respect to all the patches in the j-th column of the original input (pij for all i). 


Next we sweep over the obtained feature map V horizontally with two RNNs (/hfwd and /hrev)- 
In a similar manner as the vertical sweep, these RNNs work along each row of V resulting in the 
output feature map H — {hij}, where hij G Now, each vector hij represents the features of 
the original image patch pij in the context of the whole image. 


Let us denote by f the function from the input image map of X to the output feature map H (see 
Fig. 0 for a graphical illustration.) Clearly, we can stack multiple f’s to make the proposed ReNet 
deeper and capture increasingly complex features of the input image. After any number of recurrent 
layers are applied to an input image, the activation at the last recurrent layer may be flattened and fed 
into a differentiable classifier. In our experiments we used several fully-connected layers followed 
by a softmax classifier (as shown in Fig.|^. 


The deep ReNet is a smooth, continuous function, and the parameters (those from the RNNs as well 
as from the fully-connected layers) can be estimated by the stochastic gradient descent algorithm 
with the gradient computed by backpropagation algorithm [see, e.g., [Rumelhart et al. 1986| to 
maximize the log-likelihood. 


3 Differences between LeNet and ReNet 


There are many similarities and differences between the proposed ReNet and a convolutional neural 
network. In this section we use LeNet to refer to the canonical convolutional neural network as 
shown by [LeCun et al. 119891. Here we highlight a few key points of comparison between ReNet 
and LeNet. 


At each layer, both networks apply the same set of filters to patches of the input image or of the 
feature map from the layer below. ReNet, however, propagates information through lateral connec¬ 
tions that span across the whole image, while LeNet exploits local information only. The lateral 
connections should help extract a more compact feature representation of the input image at each 
layer, which can be accomplished by the lateral connections removing/resolving redundant features 
at different locations of the image. This should allow ReNet resolve small displacements of features 
across multiple consecutive patches. 

LeNet max-pools the activations of each filter over a small region to achieve local translation invari¬ 
ance. In contrast, the proposed ReNet does not use any pooling due to the existence of learned lateral 
connections. The lateral connection in ReNet can emulate the local competition among features in¬ 
duced by the max-pooling in LeNet. This does not mean that it is not possible to use max-pooling 
in ReNet. The use of max-pooling in the ReNet could be helpful in reducing the dimensionality of 
the feature map, resulting in lower computational cost. 

Max-pooling as used in LeNet may prove problematic when building a convolutional autoencoder 
whose decoder is an invers^of LeNet, as the max operator is not invertible. The proposed ReNet is 
end-to-end smooth and differentiable, making it more suited to be used as a decoder in the autoen¬ 
coder or any of its probabilistic variants [see, e.g., |Kingma and Welling) [2014| . 

In some sense, each layer of the ReNet can be considered as a variant of a usual convolutionH-pooling 
layer, where pooling is replaced with lateral connections, and convolution is done without any over¬ 
lap. Similarly, [Springenberg et ah] |2014| recently proposed a variant of a usual LeNet which does 
not use any pooling. They used convolution with a larger stride to compensate for the lack of dimen¬ 
sionality reduction by pooling at each layer. However, this approach still differs from the proposed 
ReNet in the sense that each feature activation at a layer is only with respect to a subset of the input 
image rather than the whole input image. 

The main disadvantage of ReNet is that it is not easily parallelizable, due to the sequential nature 
of the recurrent neural network (RNN). LeNet, on the other hand, is highly parallelizable due to 
the independence of computing activations at each layer. The introduction of sequential, lateral 

* All the forward arrows from the input to the output in the original LeNet are reversed. 
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connections, however, may result in more efficient parametrization, requiring a smaller number of 
parameters with overall fewer computations, although this needs to be further explored. We note 
that this limitation on parallelization applies only to model parallelism, and any technique for data 
parallelism may be used for both the proposed ReNet and the LeNet. 


4 Experiments 

4.1 Datasets 


We evaluated the proposed ReNet on three widely-used benchmark datasets; MNIST, CIFAR-10 
and the Street View Housing Numbers (SVHN). In this section we describe each dataset in detail. 


MNIST The MNIST dataset |LeCun et al. I999| consists of 70,000 handwritten digits from 0 to 
9, centered on a 28 x 28 square canvas. Each pixel represents the grayscale in the range of [0, 255] 
We split the dataset into 50,000 training samples, 10,000 validation samples and 10,000 test samples, 
following the standard split. 


CIFAR-10 The CIFAR-10 dataset | |Krizhevsky and Hinton) 20091 is a curated subset of the 80 
million tiny images dataset, originally released by |Torralba et al.| | |20()8| . CIFAR-10 contains 60,000 
images each of which belongs to one of ten categories; airplane, automobile, bird, cat, deer, dog, 
frog, horse, ship and truck. Each image is 32 pixels wide and 32 pixels high with 3 color channels 
(red, green and blue.) Eollowing the standard procedure, we split the dataset into 40,000 training, 
10,000 validation and 10,000 test samples. We applied zero-phase component analysis (ZCA) and 
normalized each pixel to have zero-mean and unit-variance across the training samples, as suggested 
by |Krizhevsky and Hinton] | [2009| . 


Street View House Numbers The Street View House Numbers (SVHN) dataset UNetzer et al.| 
|201 1[ consists of cropped images representing house numbers captured by Google StreetView ve¬ 
hicles as a part of the Google Maps mapping process. These images consist of digits 0 through 9 
with values in the range of [0, 255] in each of 3 red-green-blue color channels. Each image is 32 
pixels wide and 32 pixels high giving a sample dimensionality (32, 32, 3). The number of samples 
we used for training, valid, and test sets is 543,949, 60,439, and 26,032 respectively. We normalized 
each pixel to have zero-mean and unit-variance across the training samples. 


4.1.1 Data Augmentation 


It has been known that augmenting training data often leads to better generalization [see, e.g., 
Krizhevsky et al.| 2012|. We decided to employ two primary data augmentations in the following 


experiments: flipping and shifting. 


Eor flipping, we either flipped each sample horizontally with 25% chance, flipped it vertically with 
25% chance, or left it unchanged. This allows lets the model observe “mirror images” of the original 
image during training. In the case of shifting, we either shifted the image by 2 pixels to the left (25% 
chance), 2 pixels to the right (25% chance) or left it as it was. After this first processing, we further 
either shifted it by 2 pixels to the top (25% chance), 2 pixels to the bottom (25% chance) or left it as 
it was. This two-step procedure makes the model more robust to slight shifting of an object in the 
image. The shifting was done without padding the borders of the image, preserving the original size 
but dropping the pixels which are shifted out of the input while shifting in zeros. 


The choice of whether to apply these augmentation procedures on each dataset was chosen on a 
per-case basis in order to maximize validation performance. 


4.2 Model Architectures 


Gated Recurrent Units Gated recurrent units [GRU, Cho et al. 2014| and long short-term mem¬ 
ory units [LSTM, jHochreiter and SchmidhuberJ |1997) have been successful in many applications 
using recurrent neural networks [see, e.g., |Cho et al. 2014 Sutskever et al. 2014| ^ et aH 2015) . 

^ We scaled each pixel by dividing it with 255. 
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Figure 2: The ReNet network used for SVHN classification 



MNIST 

CIFAR-10 

SVHN 

A^re 

2 

3 

3 

Wp X hp 

[2 X 2]-[2 X 2] 

[2 X 2]-[2 X 2]-[2 X 2] 

[2 X 2]-[2 X 2]-[2 X 2] 

dRE 

256-256 

320-320-320 

256-256-256 

Nec 

2 

1 

2 

dFC 

4096-4096 

4096 

4096-4096 

/fc 

max(0, x) 

max(0, x) 

max(0, x) 

Flipping 

no 

yes 

no 

Shifting 

yes 

yes 

yes 


Table 1; Model architectures used in the experiments. Each row shows respectively the number of 
ReNet layers, the size of the patches, the number of neurons of each ReNet layer, the number of 
fully connected layers, the number of neurons of the fully connected layers, their activation function 
and the data augmentation procedure employed. 


To show that the ReNet model performs well independently of the specific implementation of the 
recurrent units, we decided to use the GRU on MNIST and CIFAR-10, with LSTM units on SVHN. 

The hidden state of the GRU at time t is computed by 


where 


and 


ht = {I - ut) © ht-i + ut © ht, 


ht = tanh {Wxt + U{rtQ ht-i) + h) 


[ut] n] = a {WgXt + Ught-i + bg). 


For more details on the LSTM unit, as well as for an in-depth comparison among different recurrent 
units, we refer the reader to [Chung et al. 20151. 


General Architecture The principal parameters that define the architecture of the proposed ReNet 
are the number of ReNet layers (Are), their corresponding receptive field sizes (wp x hp) and feature 
dimensionality (cIre), the number of fully-connected layers (Apc) and their corresponding numbers 
(dpc) and types (/fc) of hidden units. 

In this introductory work, we did not focus on extensive hyperparameter search to find the optimal 
validation set performance. We chose instead to focus the experiments on a small set of hyperparam¬ 
eters, with the only aim to show the potential of the proposed model. Refer to Table[2for a summary 
of the settings that performed best on the validation set of the studied datasets and to Fig. [^for a 
graphical illustration of the model we selected for SVHN. 


4.3 Training 


To train the networks we used a recently proposed adaptive learning rate algorithm, called 
Adam (Kingma and Ba 2014|. In order to reduce overfitting we applied dropout | Srivastava et al. 


|2014[ after each layer, including both the proposed ReNet layer (after the horizontal and vertica’ 
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sweeps) and the fully-connected layers. The input was also corrupted by masking out each variable 
with probability 0.2. Finally, each optimization run was early stopped based on validation error. 

Note that unlike many previous works, we did not retrain the model (selected based on the validation 
performance) using both the training and validation samples. This experiment design choice is con¬ 
sistent with our declared goal to show a proof of concept rather than stressing absolute performance. 
There are many potential areas of exploration for future work. 


Test Error 

028% 

0.31% 

0.35% 

0.39% 

0.39% 

0.4% 

0.44% 

0.45% 

0.45% 

0.47% 

0.52% 


Model 


Wan et al. 

20131* 


Graham 2 

Jl4a|7^ 


Ciresan et al. 2ulu| 

Mairal et a 

rT2apr 


Lee et al. 

20I4|* 


Simard et 

iT'20031* 

Graham 21 

Jl4b|* 


Goodfellot 

V et al. 

201 

3* 

ReNet 



Lin et al. 

>0141* 


Azzopardi and Petkov 

2013 


(a) MNIST 


Test Error 
43% 

6.28% 

8 . 8 % 

9.35% 

9.39% 

9.5% 

11 % 

11 . 10 % 

12.35% 

15.13% 

15.6% 


Model 


Graham 2 

014b ^ 


Graham 2 

014a ^ 

Lm et al. 

2014 * 

Goodfellow et al 

20131* 

Spnngenberg and Riedmiller 20131 

ShoeFeta 

r'2012|* 

Krizhevsky et al. 

2012j* 

Wan et al. 

2on 

-k 

RUNel 


Zeiler and Fergu 

20131* 


Hinton et 

il. ZU1Z|* 


(b) CIEAR-10 


Test Error 
T92% 

2.23% 

2.35% 

2.38% 

2.47% 

2 . 8 % 


Model 


Lee et al. 

20141* 


Wan et al. 

'2^* 


Lm et al. 

2UTf|* 


ReNet 



Goodfellow et al.l 

20131- 

Zeiler and Fergus! 

2013) 


(c) SVHN 


Table 2: Generalization errors obtained by 
the proposed ReNet along with those re¬ 
ported by previous works on each of the three 
datasets. * denotes a convolutional neural 
network. We only list the results reported by 
a single model, i.e., no ensembling of multi¬ 
ple models. In the case of SVHN, we report 
results from models trained on the Eormat 2 
(cropped digit) dataset only. 


5 Results and Analysis 

In Table]^ we present the results on three datasets, along with previously reported results. 

It is clear that the proposed ReNet performs comparably to deep convolutional neural networks 
which are the de facto standard for object recognition. This suggests that the proposed ReNet is a 
viable alternative to convolutional neural networks (CNN), even on tasks where CNNs have histor¬ 
ically dominated. However, it is important to notice that the proposed ReNet does not outperform 
state-of-the-art convolutional neural networks on any of the three benchmark datasets, which calls 
for more research in the future. 


6 Discussion 


Choice of Recurrent Units Note that the proposed architecture is independent of the chosen re¬ 
current units. We observed in preliminary experiments that gated recurrent units, either the GRU or 
the LSTM, outperform a usual sigmoidal unit (affine transformation followed by an element-wise 
sigmoid function.) This indirectly confirms that the model utilizes long-term dependencies across 
an input image, and the gated recurrent units help capture these dependencies. 


Analysis of the Trained ReNet In this paper, we evaluated the proposed ReNet only quantita¬ 
tively. However, the accuracies on the test sets do not reveal what kind of image structures the 
ReNet has captured in order to perform object recognition. Due to the large differences between 
ReNet and LeNet discussed in Sec.[^ we expect that the internal behavior of ReNet will differ from 
that of LeNet significantly. Eurther investigation along the line of [ Zeiler and Fergus) |2014| will be 
needed, as well exploring ensembles which combine RNNs and CNNs for bagged prediction. 
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Computationally Efficient Implementation As discussed in Sec. the proposed ReNet is less 
parallelizable due to the sequential nature of the recurrent neural network (RNN). Although this 
sequential nature cannot be addressed directly, our construction of ReNet allows the forward and 
backward RNNs to be run independently from each other, which allows for parallel computation. 
Furthermore, we can use many parallelization tricks widely used for training convolutional neural 
networks such as parallelizing fully-connected layers |Krizhevsky 2014| , having separate sets of 
kernels/features in different processors ||Krizhevsky et al.||2012) and exploiting data parallelism. 
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